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F<r  an  absorbing  Markov  chain  ^ith  a  reinfcH-cement  on  each  transition,  Bertsekas  (1995a)  gives  a  simple 
example  where  the  function  learned  by  TD(A,)  depends  on  X.  Bertsekas  showed  that  for  X=1  the 
approximation  is  optimal  with  respect  to  a  least-squares  error  of  the  value  function,  and  that  for  ^=0  the 
approximation  obtained  by  the  TD  method  is  poor  with  respect  to  the  same  metric.  With  respect  to  the 
enrcr  in  the  values,  TD(1)  approximates  the  function  better  than  TD(0).  However,  with  respect  to  the  error 
in  the  differences  in  the  values,  TD(0)  approximates  the  function  better  than  TD(1).  TD(1)  is  only  betta* 
than  TD(0)  with  respect  to  the  former  metric  rather  than  the  latter.  In  addition,  direct  TD(A.)  weights  the 
errors  unequally,  while  residual  gradient  methods  (Baird,  1995,  Harmon,  Baird,  &  Klopf,  1995)  weight  the 
errors  equally.  For  the  case  of  control,  a  simple  Markov  decision  process  is  presented  for  which  direct 
TD(0)  and  residual  gradient  TD(0)  both  learn  the  optimal  policy,  while  TD(I)  learns  a  suboptimal  policy. 
These  results  suggest  that,  for  this  example,  the  differences  in  state  values  are  more  significant  than  the 
state  values  themselves,  so  TD(0)  is  preferable  to  TD(1). 

1  Introduction 

Bertsekas  (1995a)  proposes  a  counterexample  to  the  use  of  temporal  difference  methods  for  approximating  value 
functions  in  the  context  of  Markov  chains  and  suggests  that  his  results  extend  to  the  domain  of  Markov  decision 
processes  as  well.  Bertsekas  uses  a  least-squares  error  with  respect  to  the  value  function  as  the  metric  for  evaluating 
the  functions  learned  by  TD(1)  and  TD(0).  In  the  counterexample,  the  function  learned  by  TD(0)  is  inferior  to  that 
learned  by  TD(1)  when  using  this  metric.  However,  we  observe  that  other  metrics  produce  different  results. 

Bertsekas’  examples  are  Markov  chains  with  states  0,l,2,...,n.  Each  transition  returns  a  reinforcement  with  the 
exception  of  state  0,  which  is  cost-free  and  absorbing  and  is  eventually  reached  from  every  other  state.  For  each 
initial  state  x  the  objective  is  to  estimate  the  expected  total  return  V'*(x)  received  when  following  a  series  of 
transitions  from  state  x  to  the  terminal  state. 

Like  Bertsekas,  we  use  a  linear  function  approximator  of  the  form  V(x,w)=xw  to  approximate  the  optimal  value 
function  F^Cx),  where  x  is  the  state  and  w  is  a  weight  vector.  Sutton’s  TD(k)  method  (1988)  is  a  gradient-descent- 
like  algorithm  for  obtaining  a  suitable  vector  w  after  observing  a  large  number  of  simulated  trajectories  of  the 
Markov  chain.  As  Bertsekas  (1995a)  and  Sutton  (1988)  point  out,  TD(1)  can  be  considered  a  stochastic  gradient 
descent  method  for  minimizing  an  expected  value  of  the  square  of  the  error  V*(x)-V(x,  w). 

On  the  other  hand,  TD(0)  can  be  viewed  as  a  stochastic  gradient-descent-like  method  for  minimizing  an  expected 
value  of  the  temporal  difference  error  [riXj,Xt^j)+V(Xj+],w)]-V(Xt,w).  The  TD(1)  algorithm  attempts  to  approximate 
y*  by  finding  a  function  that  is  a  direct  approximation  of  the  value  function  for  all  x,  while  the  TD(0)  algorithm 
attempts  to  approximate  V*  by  finding  a  function  whose  differences  in  values  approximates  the  differences  in  values 
of  the  value  function  for  all  x,  while  simultaneously  approximating  the  value  of  the  terminal  state.  For  this 
discussion,  it  is  useful  to  define  an  operator  that  represents  the  difference  in  value  of  two  adjacent  states.  We  define 
this  operator  ^  in  equation  (0).  This  operator  is  somewhat  like  a  derivative  or  slope,  particularly  when  y=1  (as  is 
the  case  for  the  examples  presented  here),  and  is  equivalent  to  the  difference  between  the  two  sides  of  the  Bellman 
equation  (Bertsekas,  1995b). 

^w=nx.j-Ax,)  (0) 

The  TD(X,)  update  equation  is  defined  as  follows; 

w  =  w  +  a[r(x^^^,x^ )  +  ,  w)  -  ,  w)] 

[A''-‘Vy(jr, ,  w) + r-^VV{x,^„w) + •  •  • + VVix^,  w)] 


(1) 


Note  that  if  ^=0  then  equation  (1)  reduces  to  equation  (2)  which  is  the  update  equation  for  value  iteration: 

w  =  w + -V{,x„  w)]V  V(^, ,  w)  (2) 

By  solving  the  temporal  difference  error  for  r  one  can  see  that  TD(0)  attempts  to  find  a  function  in  which  SV 
approximates  SV*,  In  a  given  state,  there  will  be  no  change  in  the  weight  vector  w  if  SV=SV*y  assuming  y=l. 

Noting  that  TD(1)  finds  a  function  that  approximates  the  value  function  V*  directly  and  TD(0)  finds  a  function 
V  in  which  SV  approximates  *,  we  can  better  evaluate  the  empirical  results  Bertsekas  presented. 

2  Markov  chains 

The  following  two  examples  are  identical  to  the  examples  presented  in  Bertsekas  (1995a).  We  use  a  linear 
function  approximator  of  the  form  V{x,w)-xw  .  The  weight  w  was  calculated  exactly  rather  than  approximated  in 
simulation.  The  state  transitions  and  associated  reinforcements  are  deterministic.  From  state  x  we  move  to  state  x-1 
with  a  given  reinforcement  All  simulation  runs  start  at  state  n  and  end  at  state  0  after  visiting  all  the  states  n-1, 
/z-2,...,l  in  succession.  The  temporal  difference  associated  with  the  transition  from  x  to  x-\  is  r^+y(x-l,>v)- 
V(x,w)=r,-w  and  the  gradient  is  VV(x,  w)  =  x. 

Example  1 

Figure  1  shows  the  value  function  V*  and  the  functions  learned  by  TD(1)  and  TD(0)  for  n=50  and  for  ri=l,  r^^O 
for  all x^l. 


0  3  6  9  12  15  18  21  24  27  30  33  36  39  42  45  48 

X 


Figure  1:  The  optimal  value  function  V*  and  the  functions  learned  by  TD(1)  and  TD(0)  for  the  case  ri=l,  r^=0  for  all  x  ^  1 . 
The  function  learned  by  TD(1)  is  a  better  approximation  to  V*  than  is  TD(0)  according  to  the  2-norm. 

In  Figure  (1),  we  can  see  that  TD(0)  yields  a  poor  approximation  to  the  value  function.  However,  TD(0)  is  not 
trying  to  directly  approximate  the  value  function.  Rather,  TD(0)  generated  a  function  Vtd(O)  for  which  SV^^^^ 

approximates  SV*.  If  one  uses  the  error  in  the  values  as  a  metric,  TD{1)  learns  the  better  function.  However,  if  one 
uses  the  error  in  the  weighted  differences  as  a  metric,  TD(0)  learns  the  better  function.  TD(X)  does  not  weight  the 
errors  (differences)  equsdly  in  all  states.  The  TD(^)  algorithm  weights  the  error  in  each  state  proportional  to  the 
frequency  with  which  that  state  is  trained  and  the  magnitude  of  V^y(x)  for  the  given  state.  TD(0)  finds  the  best 
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Figure  3:  The  optimal  value  function  V*,  and  the  functions  learned  by  TD(1),  direct  TD(0),  and  residual  gradient  TD(0)  for 
the  case  r„=-(n-l),  r,=l  for  aU  is  the  best  approximation  to  V*  according  to  the  weighted  or  unweighted  2- 

norm. 

In  Figure  (3)  the  TD(0)  algorithm  finds  a  function  Vto(0)  for  which  approximates  SV*  md  for  which 

earlier  states  (50,49,.*.)  are  given  more  weight  than  later  states  (...,2,1,0).  As  previously  stated,  TD(A,)  does  not 
weight  the  errors  equally  in  all  states.  The  TD(X)  algorithm  weights  the  error  in  each  state  proportional  to  the 
frequency  with  which  that  state  is  trained  and  the  magnitude  of  V^y(x)  for  the  given  state.  The  temporal  diJBference 
error  multiplied  by  the  derivative  of  V(x)  when  x=50  yields  a  value  that  is  much  larger  than  in  any  other  state  and 
influences  the  approximation  by  a  commensurably  disproportionate  amount.  We  stated  earlier  that  TD(A,)  is  a 
gradient-descent-like  method.  TD(A,)  is  not  a  true  gradient  descent  algorithm  for  X<1  because  no  single  error 
function,  E,  exists  whose  derivative,  ,  is  that  found  in  the  weight  vector  update  equation  used  by  the  TD(A,) 
algorithm  given  in  equation  (1).  A  class  of  algorithms  that  does  perform  gradient  descent  on  a  single  error  function, 
residual  algorithms  (Baird,  1995,  Harmon,  Baird,  and  Klopf,  1995),  weights  each  temporal  difference  error  based 
solely  on  the  frequency  with  which  it  is  trained  and  can  be  viewed  as  stochastic  gradient  descent  on  the  mean  squared 
Bellman  residual. 

Residual  algorithms  should  be  used  if  one  prefers  the  weighting  of  states  be  a  function  of  only  the  frequency 
with  which  a  state  is  trained,  and  not  a  function  of  both  the  frequency  with  which  a  state  is  trained  and  the  derivative 
of  the  value  function  with  respect  to  the  given  state.  Residual  gradient  TD(0)  is  an  algorithm  that  weights  the 
temporal  error  in  a  given  state  based  solely  on  how  often  that  state  is  visited.  This  update  is  presented  in  equation 
(6)  and  is  the  equivalent  of  residual  gradient  value  iteration.  The  function  learned  by  residual  gradient  TD(0),  for  the 
case  where  n=50,  r„=-(n-l),  and  for  all  is  also  presented  in  Figure  (3). 

w  =  yv+a[r{x„  ar,^,)+ w)-  V(a:,,  w)  -  yV(x,,  w)]  (6) 

Figure  (4)  shows  as  well  as  5Un,(0)>  and  SV^cmo)  for  this  example.  In  Figure  (4)  we  can  see  that 

the  function  learned  by  direct  TD(0)  is  disproportionately  influenced  by  the  6s  in  the  higher  (earlier)  states,  while 
residual  gradient  TD(0)  weights  the  temporal  difference  errors  equally. 
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Figure  4:  5K  *  and  the  functions  learned  by  direct  TD(0)  and  residual  gradient  TD(0)  respectively  for  the  case  r„=-(n-l),  r^=l 
for  all  ^n.  5V^*  =-49  when  x=50.  51^c7D(0)  approximation  to  5K*  according  to  the  unweighted  2-nonn  and 

is  the  best  approximation  according  to  the  weighted  2-norm. 


3  Markov  decision  processes 

In  section  (1)  we  demonstrated  that  TD(1)  finds  a  function  that  approximates  the  value  function  directly  and 
direct  TD(0)  finds  a  function  V  for  which  8V  approximates  dV  *  and  V  approximates  V*  for  the  terminal  states.  In 
section  (2)  we  demonstrated  that  TD(A<)  weights  the  temporal  difference  errors  based  on  the  derivative  of  the  value 
function  with  respect  to  the  weights  for  the  given  state  and  the  frequency  with  which  that  state  is  trained.  We  also 
showed  that  residual  gradient  TD(X,)  weights  the  errors  based  only  on  the  frequency  with  which  they  are  trained. 
Bertsekas  used  a  least-squares  error  criterion  with  respect  to  the  value  function  for  evaluating  the  utility  of  TD(A,)  for 
X<1.  Observing  Figures  (1)  and  (3),  the  learned  functions  with  respect  to  the  optimal  value  function  V*  we  see  that 
TD(1)  leams  a  better  approximation  of  V*  than  direct  TD(0)  or  residual  gradient  TD(0)  according  to  the  2-norm. 
Observing  Figures  (2)  and  (4),  the  differences  in  the  values  of  successive  states,  we  see  that  residual  gradient  TD(0) 
and  direct  TD(0)  learn  functions  that  are  better  approximations  with  respect  to  the  differences  than  the  function 
learned  by  TD(1).  It  is  not  obvious  which  metric  to  use  when  evaluating  the  utility  of  the  various  TD  methods.  In 
the  domain  of  Markov  decision  processes  another  metric  can  be  defined:  the  sum  of  the  reinforcements  when 
following  a  learned  policy  from  a  given  state  x,  TD(1)  finds  a  function  that  approximates  the  value  function  V* 
directly  and  TD(0)  finds  a  function  V  for  which  SV  approximates  SV*.  Which  of  these  two  methods  is  best  suited 
for  maximizing  the  sum  of  the  reinforcements  in  a  control  context?  Here  we  show  that,  in  at  least  one  case,  both 
the  direct  TD(0)  and  residual  gradient  TD(0)  algorithms  learn  functions  that  yield  optimal  control  policies,  while 
TD(1)  leams  a  function  that  yields  a  suboptimal  control  policy. 

Figure  (5)  depicts  the  following  MDP.  The  state  x  is  a  two  element  vector  {xi,X2}.  In  state  {0,0}  are  two 
possible  actions:  transition  to  state  {0,1},  or  transition  to  state  {1,0}.  Each  action  leads  to  a  Markov  chain.  The 
states  succeeding  state  {0,0}  are  labeled  [{0,l},{0,2},...,{0,n}]  and  [{l,0},(2,0},...{n,0}].  We  assume  that  states 
{0,n}  and  {n,0}  yield  a  return  of  0  and  are  absorbing. 
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Figure  5:  MDP  with  reinforcements  rjoo}{io}=-l>  noo){o.i}=“2,  r{990}=-192,  r(099,=0,  Vx  {0,0}, {99,0}},  and  rjo^,=2, 

Vxg{{0,0},{0,99}}. 

We  use  a  linear  approximation  of  the  form  y(x,w)=XiWi+x:2W2.  The  optimal  weights  were  calculated  exactly. 
For  this  MDP,  direct  TD(0)  and  residual  gradient  TD(0)  learned  functions  that  yield  the  optimal  policy:  when  in  state 
{0,0},  transition  to  state  {1,0}.  TD(1)  learned  a  function  that  yields  the  wrong  policy:  when  in  state  {0,0}, 
transition  to  state  {0,1 }. 

In  Figures  (6)  and  (7)  it  can  be  seen  that  TD(1)  learned  a  function  that  is  a  better  approximation  of  the  value 
function  V*  than  did  either  direct  TD(0)  or  residual  gradient  TD(0),  yet  TD(1)  yields  a  policy  that  is  inferior  to  the 
policies  generated  when  using  either  direct  TD(0)  or  residual  gradient  TD(0).  Also,  in  Figure  (6)  it  can  be  seen  that 
the  function  learned  by  TD(0)  was  affected  by  the  disproportionately  large  derivative  in  state  {99,0},  while  residual 
gradient  TD(0)  equally  weighted  the  temporal  difference  errors  in  all  states.  In  Figure  (7)  it  can  be  seen  that  the 
functions  provided  by  direct  TD(0)  and  residual  gradient  TD(0)  are  almost  identical.  In  this  case,  the  derivative  in 
state  {0,99}  is  0  for  direct  TD(0),  and  therefore  had  little  effect  on  the  approximation.  In  Figure  (7),  it  is  clear  that 
direct  TD(0)  and  residual  gradient  TD(0)  generated  functions  and  SV^cwiO)  that  closely  approximated  SV*. 


Figure  6:  The  value  function  V*  for 
states  {n,0}  where  n=0...100,  and  the 
functions  learned  by  TD(1), direct 
TD(0),  and  residual  gradient  TD(0), 
TD(1)  finds  the  best  2-norm  fit  to  V*, 
but  doesn’t  find  the  optimal  policy. 


Figure  7:  The  value  function  V*  for 
states  {0,n}  where  n=0...100,  and  the 
functions  learned  by  TD(1),  direct 
TD(0),  and  residual  gradient  TD(0). 
TD(1)  finds  the  best  2-norm  fit  to  V*, 
but  doesn’t  find  the  optimal  policy. 
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4  Conclusion 

In  Bertsekas’  example  TD(0)  appeared  worse  than  TD(1),  but  that  is  only  the  case  when  considering  the  2-norm 
of  the  value  function  error.  With  respect  to  the  difference  in  values  TD(0)  appears  better  than  TD(1).  These 
“differences”  represent  the  degree  to  which  the  value  function  fails  to  satisfy  the  Bellman  equation.  It  is  not  clear 
which  metric  should  be  used.  For  our  example,  the  salient  information  associated  with  a  state  is  not  the  accuracy  of 
the  approximation  to  the  value  function,  but  the  accuracy  of  the  approximation  of  the  difference  in  the  values  of 
adjacent  states.  In  other  words,  the  information  contained  in  the  difference  in  the  values  of  adjacent  states  is  most 
relevant  for  making  control  decisions.  TD(1)  attempts  to  approximate  the  absolute  value  of  each  state.  Both  direct 
and  residual  gradient  TD(0)  attempt  to  find  approximations  of  the  optimal  value  function  by  learning  a  function 
whose  differences  approximate  the  differences  of  the  optimal  value  function  while  simultaneously  approximating  the 
value  of  the  terminal  states.  However,  the  function  learned  by  direct  TD(0)  is  also  ajSected  by  a  disproportional 
weighting  of  the  temporal  errors  in  different  states.  Direct  TD(0)  weights  each  state  proportional  to  the  frequency 
with  which  it  is  train^  and  the  magnitude  of  V^y(x).  TD(1)  and  residual  gradient  TD(0)  weight  each  state  based 
solely  on  the  frequency  with  which  it  is  trained. 
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